Búsqueda | BVS Bolivia

1.

Combining IC₅₀ or K_i Values from Different Sources Is a Source of Significant Noise.

Landrum, Gregory A; Riniker, Sereina.

J Chem Inf Model ; 64(5): 1560-1567, 2024 Mar 11.

Artículo en Inglés | MEDLINE | ID: mdl-38394344

RESUMEN

As part of the ongoing quest to find or construct large data sets for use in validating new machine learning (ML) approaches for bioactivity prediction, it has become distressingly common for researchers to combine literature IC50 data generated using different assays into a single data set. It is well-known that there are many situations where this is a scientifically risky thing to do, even when the assays are against exactly the same target, but the risks of assays being incompatible are even higher when pulling data from large collections of literature data like ChEMBL. Here, we estimate the amount of noise present in combined data sets using cases where measurements for the same compound are reported in multiple assays against the same target. This approach shows that IC50 assays selected using minimal curation settings have poor agreement with each other: almost 65% of the points differ by more than 0.3 log units, 27% differ by more than one log unit, and the correlation between the assays, as measured by Kendall's τ, is only 0.51. Requiring that most of the assay metadata in ChEMBL matches ("maximal curation") in order to combine two assays improves the situation (48% of the points differ by more than 0.3 log units, 13% by more than one log unit, and Kendall's τ is 0.71) at the expense of having smaller data sets. Surprisingly, our analysis shows similar amounts of noise when combining data from different literature Ki assays. We suggest that good scientific practice requires careful curation when combining data sets from different assays and hope that our maximal curation strategy will help to improve the quality of the data that are being used to build and validate ML models for bioactivity prediction. To help achieve this, the code and ChEMBL queries that we used for the maximal curation approach are available as open-source software in our GitHub repository, https://github.com/rinikerlab/overlapping_assays.

Asunto(s)

Aprendizaje Automático , Programas Informáticos , Bioensayo

2.

The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods.

Zdrazil, Barbara; Felix, Eloy; Hunter, Fiona; Manners, Emma J; Blackshaw, James; Corbett, Sybilla; de Veij, Marleen; Ioannidis, Harris; Lopez, David Mendez; Mosquera, Juan F; Magarinos, Maria Paula; Bosc, Nicolas; Arcila, Ricardo; Kizilören, Tevfik; Gaulton, Anna; Bento, A Patrícia; Adasme, Melissa F; Monecke, Peter; Landrum, Gregory A; Leach, Andrew R.

Nucleic Acids Res ; 52(D1): D1180-D1192, 2024 Jan 05.

Artículo en Inglés | MEDLINE | ID: mdl-37933841

RESUMEN

ChEMBL (https://www.ebi.ac.uk/chembl/) is a manually curated, high-quality, large-scale, open, FAIR and Global Core Biodata Resource of bioactive molecules with drug-like properties, previously described in the 2012, 2014, 2017 and 2019 Nucleic Acids Research Database Issues. Since its introduction in 2009, ChEMBL's content has changed dramatically in size and diversity of data types. Through incorporation of multiple new datasets from depositors since the 2019 update, ChEMBL now contains slightly more bioactivity data from deposited data vs data extracted from literature. In collaboration with the EUbOPEN consortium, chemical probe data is now regularly deposited into ChEMBL. Release 27 made curated data available for compounds screened for potential anti-SARS-CoV-2 activity from several large-scale drug repurposing screens. In addition, new patent bioactivity data have been added to the latest ChEMBL releases, and various new features have been incorporated, including a Natural Product likeness score, updated flags for Natural Products, a new flag for Chemical Probes, and the initial annotation of the action type for â¼270 000 bioactivity measurements.

Asunto(s)

Descubrimiento de Drogas , Bases de Datos Factuales , Factores de Tiempo

3.

SIMPD: an algorithm for generating simulated time splits for validating machine learning approaches.

Landrum, Gregory A; Beckers, Maximilian; Lanini, Jessica; Schneider, Nadine; Stiefl, Nikolaus; Riniker, Sereina.

J Cheminform ; 15(1): 119, 2023 Dec 11.

Artículo en Inglés | MEDLINE | ID: mdl-38082357

RESUMEN

Time-split cross-validation is broadly recognized as the gold standard for validating predictive models intended for use in medicinal chemistry projects. Unfortunately this type of data is not broadly available outside of large pharmaceutical research organizations. Here we introduce the SIMPD (simulated medicinal chemistry project data) algorithm to split public data sets into training and test sets that mimic the differences observed in real-world medicinal chemistry project data sets. SIMPD uses a multi-objective genetic algorithm with objectives derived from an extensive analysis of the differences between early and late compounds in more than 130 lead-optimization projects run within the Novartis Institutes for BioMedical Research. Applying SIMPD to the real-world data sets produced training/test splits which more accurately reflect the differences in properties and machine-learning performance observed for temporal splits than other standard approaches like random or neighbor splits. We applied the SIMPD algorithm to bioactivity data extracted from ChEMBL and created 99 public data sets which can be used for validating machine-learning models intended for use in the setting of a medicinal chemistry project. The SIMPD code and simulated data sets are available under open-source/open-data licenses at github.com/rinikerlab/molecular_time_series.

4.

DASH: Dynamic Attention-Based Substructure Hierarchy for Partial Charge Assignment.

Lehner, Marc T; Katzberger, Paul; Maeder, Niels; Schiebroek, Carl C G; Teetz, Jakob; Landrum, Gregory A; Riniker, Sereina.

J Chem Inf Model ; 63(19): 6014-6028, 2023 Oct 09.

Artículo en Inglés | MEDLINE | ID: mdl-37738206

RESUMEN

We present a robust and computationally efficient approach for assigning partial charges of atoms in molecules. The method is based on a hierarchical tree constructed from attention values extracted from a graph neural network (GNN), which was trained to predict atomic partial charges from accurate quantum-mechanical (QM) calculations. The resulting dynamic attention-based substructure hierarchy (DASH) approach provides fast assignment of partial charges with the same accuracy as the GNN itself, is software-independent, and can easily be integrated in existing parametrization pipelines, as shown for the Open force field (OpenFF). The implementation of the DASH workflow, the final DASH tree, and the training set are available as open source/open data from public repositories.

5.

Incorporating NOE-Derived Distances in Conformer Generation of Cyclic Peptides with Distance Geometry.

Wang, Shuzhe; Krummenacher, Kajo; Landrum, Gregory A; Sellers, Benjamin D; Di Lello, Paola; Robinson, Sarah J; Martin, Bryan; Holden, Jeffrey K; Tom, Jeffrey Y K; Murthy, Anastasia C; Popovych, Nataliya; Riniker, Sereina.

J Chem Inf Model ; 62(3): 472-485, 2022 02 14.

Artículo en Inglés | MEDLINE | ID: mdl-35029985

RESUMEN

Nuclear magnetic resonance (NMR) data from NOESY (nuclear Overhauser enhancement spectroscopy) and ROESY (rotating frame Overhauser enhancement spectroscopy) experiments can easily be combined with distance geometry (DG) based conformer generators by modifying the molecular distance bounds matrix. In this work, we extend the modern DG based conformer generator ETKDG, which has been shown to reproduce experimental crystal structures from small molecules to large macrocycles well, to include NOE-derived interproton distances. In noeETKDG, the experimentally derived interproton distances are incorporated into the distance bounds matrix as loose upper (or lower) bounds to generate large conformer sets. Various subselection techniques can subsequently be applied to yield a conformer bundle that best reproduces the NOE data. The approach is benchmarked using a set of 24 (mostly) cyclic peptides for which NOE-derived distances as well as reference solution structures obtained by other software are available. With respect to other packages currently available, the advantages of noeETKDG are its speed and that no prior force-field parametrization is required, which is especially useful for peptides with unnatural amino acids. The resulting conformer bundles can be further processed with the use of structural refinement techniques to improve the modeling of the intramolecular nonbonded interactions. The noeETKDG code is released as a fully open-source software package available at www.github.com/rinikerlab/customETKDG.

Asunto(s)

Péptidos Cíclicos , Péptidos , Imagen por Resonancia Magnética , Espectroscopía de Resonancia Magnética/métodos , Modelos Moleculares , Conformación Proteica

6.

GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning.

Esposito, Carmen; Landrum, Gregory A; Schneider, Nadine; Stiefl, Nikolaus; Riniker, Sereina.

J Chem Inf Model ; 61(6): 2623-2640, 2021 06 28.

Artículo en Inglés | MEDLINE | ID: mdl-34100609

RESUMEN

Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure-activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.

Asunto(s)

Algoritmos , Aprendizaje Automático

7.

rdScaffoldNetwork: The Scaffold Network Implementation in RDKit.

Kruger, Franziska; Stiefl, Nikolaus; Landrum, Gregory A.

J Chem Inf Model ; 60(7): 3331-3335, 2020 07 27.

Artículo en Inglés | MEDLINE | ID: mdl-32584031

RESUMEN

We present an implementation of the scaffold network in the open source cheminformatics toolkit RDKit. Scaffold networks have been introduced in the literature as a powerful method to navigate and analyze large screening data sets in medicinal chemistry. Such a network can be created by iteratively applying predefined fragmentation rules to the investigated set of small molecules and by linking the produced fragments according to their descendence. This procedure results in a network graph, where the nodes correspond to the fragments and the edges correspond to the operations producing one fragment from another. In extension to the scaffold network implementations suggested in the literature, the presented implementation in RDKit allows an enhanced flexibility in terms of customizing the fragmentation rules and enables the inclusion of atom- and bond-generic scaffolds into the network. The output, providing node and edge information on the network, enables a simple and elegant navigation through the network, laying the basis to organize and better understand the data set being investigated.

Asunto(s)

Quimioinformática , Programas Informáticos , Química Farmacéutica

8.

Improving Conformer Generation for Small Rings and Macrocycles Based on Distance Geometry and Experimental Torsional-Angle Preferences.

Wang, Shuzhe; Witek, Jagna; Landrum, Gregory A; Riniker, Sereina.

J Chem Inf Model ; 60(4): 2044-2058, 2020 04 27.

Artículo en Inglés | MEDLINE | ID: mdl-32155061

RESUMEN

The conformer generator ETKDG is a stochastic search method that utilizes distance geometry together with knowledge derived from experimental crystal structures. It has been shown to generate good conformers for acyclic, flexible molecules. This work builds on ETKDG to improve conformer generation of molecules containing small or large aliphatic (i.e., non-aromatic) rings. For one, we devise additional torsional-angle potentials to describe small aliphatic rings and adapt the previously developed potentials for acyclic bonds to facilitate the sampling of macrocycles. However, due to the larger number of degrees of freedom of macrocycles, the conformational space to sample is much broader than for small molecules, creating a challenge for conformer generators. We therefore introduce different heuristics to restrict the search space of macrocycles and bias the sampling toward more experimentally relevant structures. Specifically, we show the usage of elliptical geometry and customizable Coulombic interactions as heuristics. The performance of the improved ETKDG is demonstrated on test sets of diverse macrocycles and cyclic peptides. The code developed here will be incorporated into the 2020.03 release of the open-source cheminformatics library RDKit.

Asunto(s)

Heurística , Péptidos Cíclicos , Modelos Moleculares , Conformación Molecular

9.

KNIME for reproducible cross-domain analysis of life science data.

Fillbrunn, Alexander; Dietz, Christian; Pfeuffer, Julianus; Rahn, René; Landrum, Gregory A; Berthold, Michael R.

J Biotechnol ; 261: 149-156, 2017 Nov 10.

Artículo en Inglés | MEDLINE | ID: mdl-28757290

RESUMEN

Experiments in the life sciences often involve tools from a variety of domains such as mass spectrometry, next generation sequencing, or image processing. Passing the data between those tools often involves complex scripts for controlling data flow, data transformation, and statistical analysis. Such scripts are not only prone to be platform dependent, they also tend to grow as the experiment progresses and are seldomly well documented, a fact that hinders the reproducibility of the experiment. Workflow systems such as KNIME Analytics Platform aim to solve these problems by providing a platform for connecting tools graphically and guaranteeing the same results on different operating systems. As an open source software, KNIME allows scientists and programmers to provide their own extensions to the scientific community. In this review paper we present selected extensions from the life sciences that simplify data exploration, analysis, and visualization and are interoperable due to KNIME's unified data model. Additionally, we name other workflow systems that are commonly used in the life sciences and highlight their similarities and differences to KNIME.

Asunto(s)

Biología Computacional , Programas Informáticos , Disciplinas de las Ciencias Biológicas , Secuenciación de Nucleótidos de Alto Rendimiento , Procesamiento de Imagen Asistido por Computador , Espectrometría de Masas

10.

Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach.

Schneider, Nadine; Fechner, Nikolas; Landrum, Gregory A; Stiefl, Nikolaus.

J Chem Inf Model ; 57(8): 1816-1831, 2017 08 28.

Artículo en Inglés | MEDLINE | ID: mdl-28715190

RESUMEN

Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called "topic modeling" from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to "chemical topics" and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like "proteins", "DNA", or "steroids". Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

Asunto(s)

Minería de Datos/métodos , Bases de Datos de Compuestos Químicos , Algoritmos

11.

Virtual-screening workflow tutorials and prospective results from the Teach-Discover-Treat competition 2014 against malaria.

Riniker, Sereina; Landrum, Gregory A; Montanari, Floriane; Villalba, Santiago D; Maier, Julie; Jansen, Johanna M; Walters, W Patrick; Shelat, Anang A.

F1000Res ; 6: 1136, 2017.

Artículo en Inglés | MEDLINE | ID: mdl-28928948

RESUMEN

The first challenge in the 2014 competition launched by the Teach-Discover-Treat (TDT) initiative asked for the development of a tutorial for ligand-based virtual screening, based on data from a primary phenotypic high-throughput screen (HTS) against malaria. The resulting Workflows were applied to select compounds from a commercial database, and a subset of those were purchased and tested experimentally for anti-malaria activity. Here, we present the two most successful Workflows, both using machine-learning approaches, and report the results for the 114 compounds tested in the follow-up screen. Excluding the two known anti-malarials quinidine and amodiaquine and 31 compounds already present in the primary HTS, a high hit rate of 57% was found.

12.

What's What: The (Nearly) Definitive Guide to Reaction Role Assignment.

Schneider, Nadine; Stiefl, Nikolaus; Landrum, Gregory A.

J Chem Inf Model ; 56(12): 2336-2346, 2016 12 27.

Artículo en Inglés | MEDLINE | ID: mdl-28024398

RESUMEN

When analyzing chemical reactions it is essential to know which molecules are actively involved in the reaction and which educts will form the product molecules. Assigning reaction roles, like reactant, reagent, or product, to the molecules of a chemical reaction might be a trivial problem for hand-curated reaction schemes but it is more difficult to automate, an essential step when handling large amounts of reaction data. Here, we describe a new fingerprint-based and data-driven approach to assign reaction roles which is also applicable to rather unbalanced and noisy reaction schemes. Given a set of molecules involved and knowing the product(s) of a reaction we assign the most probable reactants and sort out the remaining reagents. Our approach was validated using two different data sets: one hand-curated data set comprising about 680 diverse reactions extracted from patents which span more than 200 different reaction types and include up to 18 different reactants. A second set consists of 50â¯000 randomly picked reactions from US patents. The results of the second data set were compared to results obtained using two different atom-to-atom mapping algorithms. For both data sets our method assigns the reaction roles correctly for the vast majority of the reactions, achieving an accuracy of 88% and 97% respectively. The median time needed, about 8 ms, indicates that the algorithm is fast enough to be applied to large collections. The new method is available as part of the RDKit toolkit and the data sets and Jupyter notebooks used for evaluation of the new method are available in the Supporting Information of this publication.

Asunto(s)

Descubrimiento de Drogas , Modelos Químicos , Programas Informáticos , Algoritmos , Bases de Datos de Compuestos Químicos , Descubrimiento de Drogas/métodos , Indicadores y Reactivos/química , Patentes como Asunto

13.

Big Data from Pharmaceutical Patents: A Computational Analysis of Medicinal Chemists' Bread and Butter.

Schneider, Nadine; Lowe, Daniel M; Sayle, Roger A; Tarselli, Michael A; Landrum, Gregory A.

J Med Chem ; 59(9): 4385-402, 2016 05 12.

Artículo en Inglés | MEDLINE | ID: mdl-27028220

RESUMEN

Multiple recent studies have focused on unraveling the content of the medicinal chemist's toolbox. Here, we present an investigation of chemical reactions and molecules retrieved from U.S. patents over the past 40 years (1976-2015). We used a sophisticated text-mining pipeline to extract 1.15 million unique whole reaction schemes, including reaction roles and yields, from pharmaceutical patents. The reactions were assigned to well-known reaction types such as Wittig olefination or Buchwald-Hartwig amination using an expert system. Analyzing the evolution of reaction types over time, we observe the previously reported bias toward reaction classes like amide bond formations or Suzuki couplings. Our study also shows a steady increase in the number of different reaction types used in pharmaceutical patents but a trend toward lower median yield for some of the reaction classes. Finally, we found that today's typical product molecule is larger, more hydrophobic, and more rigid than 40 years ago.

Asunto(s)

Química Farmacéutica , Industria Farmacéutica , Patentes como Asunto , Historia del Siglo XX , Historia del Siglo XXI , Recursos Humanos

14.

Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.

Riniker, Sereina; Landrum, Gregory A.

J Chem Inf Model ; 55(12): 2562-74, 2015 Dec 28.

Artículo en Inglés | MEDLINE | ID: mdl-26575315

RESUMEN

Small organic molecules are often flexible, i.e., they can adopt a variety of low-energy conformations in solution that exist in equilibrium with each other. Two main search strategies are used to generate representative conformational ensembles for molecules: systematic and stochastic. In the first approach, each rotatable bond is sampled systematically in discrete intervals, limiting its use to molecules with a small number of rotatable bonds. Stochastic methods, on the other hand, sample the conformational space of a molecule randomly and can thus be applied to more flexible molecules. Different methods employ different degrees of experimental data for conformer generation. So-called knowledge-based methods use predefined libraries of torsional angles and ring conformations. In the distance geometry approach, on the other hand, a smaller amount of empirical information is used, i.e., ideal bond lengths, ideal bond angles, and a few ideal torsional angles. Distance geometry is a computationally fast method to generate conformers, but it has the downside that purely distance-based constraints tend to lead to distorted aromatic rings and sp(2) centers. To correct this, the resulting conformations are often minimized with a force field, adding computational complexity and run time. Here we present an alternative strategy that combines the distance geometry approach with experimental torsion-angle preferences obtained from small-molecule crystallographic data. The torsional angles are described by a previously developed set of hierarchically structured SMARTS patterns. The new approach is implemented in the open-source cheminformatics library RDKit, and its performance is assessed by comparing the diversity of the generated ensemble and the ability to reproduce crystal conformations taken from the crystal structures of small molecules and protein-ligand complexes.

Asunto(s)

Algoritmos , Modelos Moleculares , Procesos Estocásticos , Conformación Molecular , Compuestos Orgánicos/química

15.

Get Your Atoms in Order--An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm.

Schneider, Nadine; Sayle, Roger A; Landrum, Gregory A.

J Chem Inf Model ; 55(10): 2111-20, 2015 Oct 26.

Artículo en Inglés | MEDLINE | ID: mdl-26441310

RESUMEN

Finding a canonical ordering of the atoms in a molecule is a prerequisite for generating a unique representation of the molecule. The canonicalization of a molecule is usually accomplished by applying some sort of graph relaxation algorithm, the most common of which is the Morgan algorithm. There are known issues with that algorithm that lead to noncanonical atom orderings as well as problems when it is applied to large molecules like proteins. Furthermore, each cheminformatics toolkit or software provides its own version of a canonical ordering, most based on unpublished algorithms, which also complicates the generation of a universal unique identifier for molecules. We present an alternative canonicalization approach that uses a standard stable-sorting algorithm instead of a Morgan-like index. Two new invariants that allow canonical ordering of molecules with dependent chirality as well as those with highly symmetrical cyclic graphs have been developed. The new approach proved to be robust and fast when tested on the 1.45 million compounds of the ChEMBL 20 data set in different scenarios like random renumbering of input atoms or SMILES round tripping. Our new algorithm is able to generate a canonical order of the atoms of protein molecules within a few milliseconds. The novel algorithm is implemented in the open-source cheminformatics toolkit RDKit. With this paper, we provide a reference Python implementation of the algorithm that could easily be integrated in any cheminformatics toolkit. This provides a first step toward a common standard for canonical atom ordering to generate a universal unique identifier for molecules other than InChI.

Asunto(s)

Algoritmos , Modelos Moleculares , Bibliotecas de Moléculas Pequeñas/química , Programas Informáticos , Estereoisomerismo

16.

Corrections to "development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity".

Schneider, Nadine; Lowe, Daniel M; Sayle, Roger A; Landrum, Gregory A.

J Chem Inf Model ; 55(2): 474, 2015 Feb 23.

Artículo en Inglés | MEDLINE | ID: mdl-25647286

17.

Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity.

Schneider, Nadine; Lowe, Daniel M; Sayle, Roger A; Landrum, Gregory A.

J Chem Inf Model ; 55(1): 39-53, 2015 Jan 26.

Artículo en Inglés | MEDLINE | ID: mdl-25541888

RESUMEN

Fingerprint methods applied to molecules have proven to be useful for similarity determination and as inputs to machine-learning models. Here, we present the development of a new fingerprint for chemical reactions and validate its usefulness in building machine-learning models and in similarity assessment. Our final fingerprint is constructed as the difference of the atom-pair fingerprints of products and reactants and includes agents via calculated physicochemical properties. We validated the fingerprints on a large data set of reactions text-mined from granted United States patents from the last 40 years that have been classified using a substructure-based expert system. We applied machine learning to build a 50-class predictive model for reaction-type classification that correctly predicts 97% of the reactions in an external test set. Impressive accuracies were also observed when applying the classifier to reactions from an in-house electronic laboratory notebook. The performance of the novel fingerprint for assessing reaction similarity was evaluated by a cluster analysis that recovered 48 out of 50 of the reaction classes with a median F-score of 0.63 for the clusters. The data sets used for training and primary validation as well as all python scripts required to reproduce the analysis are provided in the Supporting Information.

Asunto(s)

Inteligencia Artificial , Bases de Datos de Compuestos Químicos , Modelos Químicos , Análisis por Conglomerados , Fenómenos Químicos Orgánicos , Patentes como Asunto , Reproducibilidad de los Resultados

18.

Using information from historical high-throughput screens to predict active compounds.

Riniker, Sereina; Wang, Yuan; Jenkins, Jeremy L; Landrum, Gregory A.

J Chem Inf Model ; 54(7): 1880-91, 2014 Jul 28.

Artículo en Inglés | MEDLINE | ID: mdl-24933016

RESUMEN

Modern high-throughput screening (HTS) is a well-established approach for hit finding in drug discovery that is routinely employed in the pharmaceutical industry to screen more than a million compounds within a few weeks. However, as the industry shifts to more disease-relevant but more complex phenotypic screens, the focus has moved to piloting smaller but smarter chemically/biologically diverse subsets followed by an expansion around hit compounds. One standard method for doing this is to train a machine-learning (ML) model with the chemical fingerprints of the tested subset of molecules and then select the next compounds based on the predictions of this model. An alternative approach would be to take advantage of the wealth of bioactivity information contained in older (full-deck) screens using so-called HTS fingerprints, where each element of the fingerprint corresponds to the outcome of a particular assay, as input to machine-learning algorithms. We constructed HTS fingerprints using two collections of data: 93 in-house assays and 95 publicly available assays from PubChem. For each source, an additional set of 51 and 46 assays, respectively, was collected for testing. Three different ML methods, random forest (RF), logistic regression (LR), and naïve Bayes (NB), were investigated for both the HTS fingerprint and a chemical fingerprint, Morgan2. RF was found to be best suited for learning from HTS fingerprints yielding area under the receiver operating characteristic curve (AUC) values >0.8 for 78% of the internal assays and enrichment factors at 5% (EF(5%)) >10 for 55% of the assays. The RF(HTS-fp) generally outperformed the LR trained with Morgan2, which was the best ML method for the chemical fingerprint, for the majority of assays. In addition, HTS fingerprints were found to retrieve more diverse chemotypes. Combining the two models through heterogeneous classifier fusion led to a similar or better performance than the best individual model for all assays. Further validation using a pair of in-house assays and data from a confirmatory screen--including a prospective set of around 2000 compounds selected based on our approach--confirmed the good performance. Thus, the combination of machine-learning with HTS fingerprints and chemical fingerprints utilizes information from both domains and presents a very promising approach for hit expansion, leading to more hits. The source code used with the public data is provided.

Asunto(s)

Ensayos Analíticos de Alto Rendimiento/métodos , Informática/métodos , Algoritmos , Inteligencia Artificial , Teorema de Bayes , Modelos Logísticos

19.

Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision making by committee can be a good thing.

Riniker, Sereina; Fechner, Nikolas; Landrum, Gregory A.

J Chem Inf Model ; 53(11): 2829-36, 2013 Nov 25.

Artículo en Inglés | MEDLINE | ID: mdl-24171408

RESUMEN

The concept of data fusion - the combination of information from different sources describing the same object with the expectation to generate a more accurate representation - has found application in a very broad range of disciplines. In the context of ligand-based virtual screening (VS), data fusion has been applied to combine knowledge from either different active molecules or different fingerprints to improve similarity search performance. Machine-learning (ML) methods based on fusion of multiple homogeneous classifiers, in particular random forests, have also been widely applied in the ML literature. The heterogeneous version of classifier fusion - fusing the predictions from different model types - has been less explored. Here, we investigate heterogeneous classifier fusion for ligand-based VS using three different ML methods, RF, naïve Bayes (NB), and logistic regression (LR), with four 2D fingerprints, atom pairs, topological torsions, RDKit fingerprint, and circular fingerprint. The methods are compared using a previously developed benchmarking platform for 2D fingerprints which is extended to ML methods in this article. The original data sets are filtered for difficulty, and a new set of challenging data sets from ChEMBL is added. Data sets were also generated for a second use case: starting from a small set of related actives instead of diverse actives. The final fused model consistently outperforms the other approaches across the broad variety of targets studied, indicating that heterogeneous classifier fusion is a very promising approach for ligand-based VS. The new data sets together with the adapted source code for ML methods are provided in the Supporting Information .

Asunto(s)

Algoritmos , Inteligencia Artificial , Minería de Datos , Ensayos Analíticos de Alto Rendimiento/estadística & datos numéricos , Proteínas/química , Interfaz Usuario-Computador , Teorema de Bayes , Benchmarking , Bases de Datos de Compuestos Químicos , Toma de Decisiones , Ligandos , Modelos Logísticos , Modelos Moleculares , Proteínas/agonistas , Proteínas/antagonistas & inhibidores

20.

Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods.

Riniker, Sereina; Landrum, Gregory A.

J Cheminform ; 5(1): 43, 2013 Sep 24.

Artículo en Inglés | MEDLINE | ID: mdl-24063533

RESUMEN

: Fingerprint similarity is a common method for comparing chemical structures. Similarity is an appealing approach because, with many fingerprint types, it provides intuitive results: a chemist looking at two molecules can understand why they have been determined to be similar. This transparency is partially lost with the fuzzier similarity methods that are often used for scaffold hopping and tends to vanish completely when molecular fingerprints are used as inputs to machine-learning (ML) models. Here we present similarity maps, a straightforward and general strategy to visualize the atomic contributions to the similarity between two molecules or the predicted probability of a ML model. We show the application of similarity maps to a set of dopamine D3 receptor ligands using atom-pair and circular fingerprints as well as two popular ML methods: random forests and naïve Bayes. An open-source implementation of the method is provided.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA